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ABSTRACT 


An on-line, general-purpose, fact-retrieval system is presented 
which employs a classificatory data structuring technique. The 
technique embraces the basic concept of hierarchical classification 
of data and provides users with multiple avenues of access to a data 


file. Additionally, the data file may be partitioned into unrelated 


data sets. 
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I. INTRODUCTION 


The term "information retrieval"! and the initials "IR'' were 
coined by the editors of Fortune about ten years ago. However, 
Vannevar Bush first formally declared the necessity for an informa- 
tion retrieval discipline in his ''As We May Think" article which was 
written for Atlantic Monthly in 1946. The United States Government 
and those people involved in Library Science were truly the first in- 
novators of this discipline in the mid-fifties. The technological ex- 
plosion being felt at that time prompted government agencies and 
library scientists to search for more efficient systems for indexing, 
storing, and retrieving documents. Primary concern was the assur- 
ance that vital technical information would be available to all possible 
users. The discipline of information retrieval as we know it today 
emerged as a result of this technological explosion. 

Information retrieval has been defined in numerous ways. How- 
ever, all definitions share a common point which is best stated by 
Taube [Ref. l] as: ''The right information made available to the right 
person at the right time.'' Bourne [Ref. 2] states that 'Information 
retrieval has become a generic term, firmly established through 
common usage, which includes reference, fact, and document retrie- 
val.'"' Bourne also differentiates between data processing and informa- 
tion retrieval. The former includes the manipulation, replacement, 


alteration, or addition to the data on file while the later is concerned 





with the storage of data in unaltered form for later re-use. Use of 
the term ‘information retrieval" in this paper implies the generic 
meaning stated by Bourne. 

This paper is devoted to the investigation of a data structuring 
concept proposed by Kildall [Ref. 3] for use ina general-purpose 
fact-retrieval system. Before investigating Kildall's proposal in 
section VI, the techniques of indexing, storage, and retrieval estab- 
lished for Library Science purposes will be reviewed. These basic 
techniques form a foundation for the design of specific IR systems. 


Information retrieval is divided into three major operatives: 


1. Indexing (classification, description, and structuring of 
information sources). 


2. Storage (organization and storage of files). 
3. Retrieval (searching and displaying information). 

Figure 1 is a simplified diagram which illustrates a typical 
information retrieval process. An index is constructed which de- 
scribes the information source (document or record) and is stored in 
a file along with the source itself. A request for information (query) 
is directed to the index file where the location of the requested docu- 
ment within the information file is found. A search of the information 
file then results in the retrieval of the document. This process is 
analogous to the indexing and storing of new books received ina 


library, and the search for information by a library patron. 
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Basic Flow Diagram of the Information Retrieval Process 
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Ii, INDEXING 


Indexing is the classification, description, and structuring of 
information in such a manner that retrieval of the information is 
accomplished expeditiously. This task is performed on information 
sources such as books, documents, and files and is an integral part 
of the information retrieval process. Since retrieval is the counter- 
part of indexing, the indexing and retrieval schemes used inanIR 
system must be compatible in order for a user to communicate with 
the system. Clearly, retrieval efficiency (i.e., ease and speed of 
retrieving desired information with a minimum of false drops!) is 
related to the efficiency and consistency of the indexing process. 

As arule, the information base of an IR system is specialized 
and as such requires a professional jargon. Ideally, the indexer and 
system user are experts in this professional language. However, this 
may not necessarily be true and causes a problem commonly confronted 
by IR system designers. The problem is how to structure specialized 
data for input to the system in a manner that is convenient to both the 
indexer and user while maintaining data accessibility. An example of 
an indexing language is the Dewey Decimal System used for indexing 
library books. 


Selection of an indexing language is based upon the following 


considerations: 


1 Output of irrelevant information as a result of a retrieval 
request is called a ''false drop. "' 
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1. The language should be convenient to use, such as natural 
language or a language that could be easily learned. 


| 2. Computerized systems require that the language be rigid 
enough to be usable in the machine but must also remain convenient 


for human utility. 


3. The vocabulary should be broad enough to allow accurate 
description of the information. 


4. The language should be flexible enough to allow modification 
as changes in information occur. 


There are numerous indexing languages in use today each 
tailored to suit specific usage of the IR system. Therefore, indexing 
languages normally reflect the viewpoint of the system designer in 
his attempt to organize the system's data base to best suit the needs 
of the user. Several indexing techniques which evolved from Library 
Science will be reviewed in the sections that follow. These techniques 
appear to form the nucleus from which specialized systems are formed. 
Although the techniques are primarily oriented toward document in- 
dexing, variations are used in all types of IR systems. The techniques 
are presented in ascending order of: 

1. Effort on the part of the indexer. 

2. Difficulty in automating. 

3. Indexing power 


4. Retrieval efficiency. 


igh UNIT-TERM INDEXING 


The simplest indexing technique involves the extraction of 


descriptive words from the information source. The source is then 
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associated with each of the terms used to describe its content. In 

the case of a library book, or other document, descriptive words 

may be taken from the title, abstract, or the text itself. This tech- 
nique requires a minimum of effort (other than reading the source) 

on the part of the indexer. In addition, the indexing is accomplished 
rather quickly since the indexer need not be ultimately familiar with 
the subject material. Unit-term indexing is particularly advantageous 
when no information is available on the spread of subject material. 
The addition of new material to the data base is easily accomplished 
by expanding the vocabulary (unit-terms) to include new descriptive 
words. However, unit-term indexing lacks rules for combining terms 
into units which have meaning. This shortcoming causes indexing 
problems when synonyms, plural word forms, and generically related 
terms are encountered in the source document. 

The search device used in such a system is an alphabetical 
listing (indexing record) of the key words used by the indexer. In 
general, the information source is’ listed with each key word and 
is used as a source descriptor, or the listing may indicate the location 
of the source, or both. It is possible that the user will have difficulty 
in using this system unless he knows precisely the topic that he is 
searching for. An analogy may be drawn to searching the telephone 
book for a name when the spelling of the name is not known. There- 


fore, this indexing scheme is often utilized in IR systems where the 


user is familiar with the professional jargon contained in the 
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information sources (e.g., technical libraries). 

An excellent eample of this subject-indexing“ technique is the 
Uniterm Coordinate Indexing System which dates back to 1952. The 
Uniterm ("'unit-term'"') System includes fifteen rules governing the 
indexer's operation, rules for determining key words, methods for 
processing word meanings, and cross-referencing techniques. Some 
agencies using this system have drafted standard unit-terms (key 
words) to be used by indexers. However, this is unnecessary for an 
unstructured language since new unit-terms may be added without 
perturbing the existing system. An example of an index that might 
be constructed from a Uniterm System is shown below. The numbers 
below the unit-terms might represent reference serial numbers, or 
library call numbers. 

ABLATION 
ADGmoOto™ Lic 
ADSORPTION 

137 459 823 1201 
ADHESIVE 

aot 
AERODYNAMIC 


139 241 242 357 552 1010 1168 


"Subject indexing, '' 'keyword indexing, '' and ''coordinate 
indexing'' are terms commonly used to describe the technique presented 


here. 
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B. KEY -WORD-IN-CONTEXT INDEXING 

, Another very common subject indexing technique is called ''Key- 

| . 
Weerd_tn-Context" (KWIC) indexing?. The indexing power of KWIC 
is very slightly greater than the simplest of subject indexing techniques 
since the key word is shown in the context of the entire subject. There 
are several variations in KWIC format but essentially it is an alphabet- 
ical listing of key words. Whole phrases are extracted from the 
source so that a user can easily determine the role of the key word. 
The distinguishing feature of KWIC is its displa: format shown in the 
example below. Let us suppose that the title of a source document 
is: "Principles of Automated Inf. mation Retrieval. '' Assuming that 
the indexer selects four key words to describe the source, the KWIC 
index would appear as: 

"5135 Principles of AUTOMATED Information Retrieval 
iples of Automated INFORMATION Retrieval 5135 Princ 
ion Retrieval 5135 PRINCIPLES of Automated Informat 
omated Information RETRIEVAL 5135 Principles of Aut" 

Mote that “automated”, “information”, “principles"',and retrieval” 
are individual key words. A user desiring this source document 
could find it by using any one of the four key words. Note also that a 


user may find this system easier to use than the Uniterm System if he 


is unfamiliar with the subject material. 


> Also referred to as "permuted" or "permuted title'' indexing. 
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ce fii sAURUS 
Indexing power may be increased further by determining generic 
relationships between key words. The Armed Forces Information 
Agency (ASTIA) and the Defense Documentation Center (DDC) have 
produced thesauri which are alphabetical lists of indexing terms with 
related terms and ''see'' references. These lists are used by indexers 
as means of standardizing their operation. In other words, indexers 
describe similar information sources in consistent fashion. These 
thesauri define some hierarchy in key words and are useful to the 
user as well as indexer since they allow the user to formulate queries 
with the exact terms used by the indexer. An example of a thesaurus 
borrowed from Meadow [Ref. 4] is exhibited below. 
COMPUTERS 
(Computers and Data Systems) 
Includes: 
Calculating machines 
Generic to: 
ANALOG COMPUTERS 
ANALOG-DIGITAL COMPUTERS 


BOMBING COMPUTERS 


e 


Also see: 
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Dae | ROCK SOING SYSTEMS 


SIMULATION 


Computing gun sights use GUN SIGHTS 


D. HIERARCHICAL CLASSIFICATION 

Probably the most widely used indexing technique is that of 
hierarchical classification where a universe of information is repeated- 
ly divided and sub-divided into a classificatory tree. This index 
language has a very tightly controlled but simple vocabulary contained 
in an authority list of key words provided with the classification system. 
Each key word in the authority list is assigned a numeric or alphanumeric 
code (mnemonic codes could be used but normally are not). As can be 
seen in the tree structure exhibited below, a key word contains all 
those key words generic to it (i.e., above it in the branch of the tree 
from which it was derived). Hierarchical schemes allow the indexer 
to describe an information source in generic levels so that the user 
may formulate his query in more general or more specific terms by 
moving up or down the classification tree. 

Modification of key word meaning is difficult to accomplish 
since changing one word in the tree affects all key words generic to 
it. However, changes at the bottom of the tree are easily made since 


no perturbation of the tree occurs. Expansion of the vocabulary used 
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in this sytem is readily accomplished by expanding the tree horizon- 
tally. 

The most well known hierarchical systems are the Dewey 
Decimal Classification System (exhibited below), the Library of 


Congress System, and the Universal Decimal Classification System. 


@ x 
so a 
7 XN 
\ ya nae 
on Ce ne 
\ ‘ 
yo a 
519. 92 
500 Pure Science 
510 Mathematics 
519 Probabilities and Statistical Mathematics 
519.9 Treatment of Data 
Dido c Programming (linear and dynamic) 


FS FACETED INDEXING 
In the immediately preceding section a classification technique 
was presented which structures a topic (universe of information) by 


dividing and subdividing it to forma classificatory tree. Faceted 


indexing deals with individual key words taken from the data source 


Wy, 


and grouped into categories with respect to their usage within the 
source. Terms within each group are structured into a classificatory 
tree. A term extracted from the source is analyzed from several 
points of view and a group of indexing terms are synthesized to de- 
scribe the key word in context. This technique is referred to as 
"facet analysis, '' 'faceted indexing, '' and "relational indexing" where 
each key word's point -of-view-analysis is called a facet. 

An excellent example of faceted indexing is given by Meadow 
[Ref. 4]. Let us suppose that "steel'' is a key word taken froma 
source document. The document contains information relating to the 
manufacture, use, chemical analysis, and properties of steel. By 
appending descriptors to the key word ''steel'' the following index 
terms are created: 

STEEL, manufacture of 


STEEL, use in automobiles 


These index terms are not predefined in any authority list but 
are constructed by the indexer by appending descriptors to the key 
word. The terms follow some Syntactic rule suchas: subject followed 
by modifier, followed by operation modifier. The utility of this 
technique is that the indexer, armed with a descriptor list and syn- 


tactic rules tailored to suit the particular IR system, may analyze 
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a source from many points of view and construct index terms that 


describe the information content in great detail. 


iP. Pm LOMATIC INDEXING 

In the foregoing discussions, it was assumed that the indexer 
was human. A treatment of automatic (computer) indexing is now in 
order. | 

Automatic indexing is difficult to accomplish for two main rea- 
sons. First, the information source must be in machine readable 
form. In the case of books or other lengthy documents this is a very 
expensive requirement. However, development of character recog- 
nition devices and the production of transcripts in machine code as a 
by-product of automatic typesetting have eased the cost of this require- 
ment. The second problem, and the more serious, is the development 
of algorithms or heuristics which derive meaning from strings of 
characters. This is an area of Artificial Intelligence in which a good 
deal of research has been expended. However, the results of this 
research have been empirical since we lack sophisticated linguistic 
and semantic knowledge. References 5, 6, 7, and 8 contain excellent 
treatments of the research conducted and problems involved in machine 
translation of natural language while ref. 9 contains a comparison of 
manual and automatic indexing techniques. 

There is an automatic indexing technique in commercial use 


today; however, itis a "brute force! adaptation of KWIC. Basically, 
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the technique produces index key words by comparing words from the 
aeerce to words stored in an authority list. There are many limita- 
Boris to this system such as correct handling of hyphenated words, 
plural forms, and proper nouns but the primary limitation is that the 
list must contain a sufficient number of appropriate words in order 
for a source to be adequately indexed. The size, speed, and complex- 
ity of such a system should be obvious. 

Referring to figure 1 it is seen that the indexing process produces 
index records. The contents of the records vary widely and are de- 
pendent upon the type of IR system (e.g., document, fact, or reference). 
In addition to subject descriptors, the index may contain the location 
of the information, source, author, reference to another eee record, 
or other information deemed pertinent by the system designer. It will 
also be noted from the figure that the information source, or informa- 
tion concerning the source, will also be stored in the IR system. In 
the case of a large document such as a book, it probably will not be 
stored in the computer but rather a reference or abstract will be 
stored as a substitute. In some cases, the index record itself will 
contain all of the information associated with an information source. 
For example, an index record for a library book may contain the 
book's location within the library, therefore, the system will present 


the index record itself in answer to a user's query. 


2a 





Die | OLA GE 


This section of the paper contains descriptions of various 
techniques used for organizing index and information files within an 
IR System's storage media. There will be no discussion of storage 
devices since it is assumed that the reader is already familiar with 
computer equipment. The reader is aware, of course, that the 
system's capacity, cost, and response time are greatly affected by 


the selection of various storage media. 


aA, FILE ORGANIZATION 

Organization of an index file or information file specifies the 
positioning of the records in relation to one another within the file 
along with the physical position of the file within the storage media. 
Choice of a rule which governs file organization is dependent upon 
desired response time, peak retrieval loads, system reliability>, 
category of users, cost, rate of information change, rate of system 
growth, and type of storage media. There are several rules for file 
organization which are extensively used in IR systems and they are 
presented here. These rules are equally applicable to index and in- 
formation files. 


ies seq uential Organization 


The first method involves the sequential placement of 


records within a file. The (i+1)°®° record follows (physically and/or 


aoa lity to retrieve a maximum of information with a minimum of 
false drops. 
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h record. For example, the alphabetical listing of 


logically) the i! 
subject-indexing key words, alphabetical arrangement of employee 
records, etc. This method is very conservative of memory space 
since there is no need to supply pointers or links to indicate where 
the next record in the file is located. On the other hand, additions or 
deletions to the file are difficult to make. Let us suppose that we 
desire to add a new name to the telephone book. Then all of the names 
which follow the inserted name must be moved. Likewise, the deletion 
of a name results in perturbation of the list. This type of organization 
is most commonly used with magnetic tape where records are searched 
sequentially. 
2. Chaining 

Another technique of file organization is called "chaining"! 
where addresses (links, chains, or pointers) are stored in one or 
more fields of a record to indicate the location of the next record 
within the file. Recall from the discussion of indexing that thesauri 
contain ''see'' references. These references are links which convey 
the idea of chaining. Chaining is a particularly effective method when 
used in a crowded memory since "referred to'' records may be placed 
in any available space within the memory (unlike the rigid sequential 
scheme). Also, the utility of chaining is fully realized in a system 
which experiences a high rate of information change. This method 


requires more memcry space than the sequential scheme since extra 


fields must be appended to the records to accommodate the links. 
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a. Branching 
An extension of the chaining technique is referred to as a 
"branching structure.'' Branching is used to achieve versatility in 
changing record entries, changing file structures, and conversion, 
where possible, of variable-length records to fixed-length records. 
A trivial example is shown in Figure 2. which exhibits the idea of 
branching file structures. 

Let us suppose that our file consists of all military flying clubs 
in the United States. Each record consists of the club's name, address 
fairport, city, state), membership, and type of aircraft. Obviously, 
these records are variable-length because the number of aircraft 
owned by each club is variable. The main file may be converted to 
fixed-length records by replacing the aircraft type fields with a single 
address. The aircraft types could then be included in another fixed- 
length file. The address in the main record links to an address file 
which in turn points to the file containing the aircraft types. Repetition 
of aircraft type is eliminated from the main records, main records 
are fixed-length, and changes are made only to the address file not 
the main file or aircraft file. 

Figure 3 exhibits another feature of this technique which replaces 
all field entries in the main file (except the name) with addresses. If 
it is later decided to add "county"! to "'city'' and "state'' then no changes 


are required in the main file but a field must be added to each of the 


"city-state" file records to absorb the new addition. 
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MAIN FILE 





















: NEXT 
NAME ATRPORT ADDRESS AIRCRAFT RECORD 
MONTEREY 
NALF 
17 MINOR ABC NEEDLES 
AFB INTERN ' TL 101 
CA. 
ADDRESS FILE 

100 
rO1 

CESSNA 
210 fo0 

CESSNA 
Z 
: : 

CESSNA 
az 180 

AIRCRAFT FILE 
CHEROKEE 

215 


PT-19 


Conversion of Variable-Length Records to Fixed-Length Records 
using the Branching Technique. 


Figure 2 
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MAIN FILE 


° NEXT 
NAME AIRPORT ADDRESS AIRCRAFT RECORD 


NALF 
MONTEREY 





CITY/STATE FILE BEFORE ADDITION OF COUNTY 


498 | MONTEREY, CA. 


513 ] NEEDLES, CA. 


CITY/STATE FILE AFTER ADDITION OF COUNTY 


498 | MONTEREY, CA. MONTEREY 





513 | NEEDLES, CA. XYZ 


Figure 3 


Addition of Records to an Existing Branching Structure 


ad 





Bee 1st Structuring 


, Although chaining and branching allow records to be scattered 


throughout memory, their membership in a particular file is main- 
tained by some order of relative placement (e.g., employee records 
logically linked in alphabetical order but physically scattered through- 
out the file). List structuring does not require that records be ordered 
in any specific manner within a file. Further, the fields of a record 
may be physically separated and then linked to form a logical record. 
The advantage of this form of storage is the freedom of changing field 
content structure, record content, and file structure. However, this 
method requires a great deal more memory space than any other 
technique. In addition, the retrieval process is relatively slow since 
more time is required to gather the elements of a record together. 

The three techniques of file organization described above are all 
forms of list structuring and each demonstrates a different degree of 
structural freedom. Chaining requires that fields remain contiguous, 
but records, while remaining ordered, may be physically separated. 
Branching is an extension of chaining allowing fields to contain address 


linkages to other fields. The last method allows any ordering and 


structuring of fields and records. 


B. FILE SEQUENCING 
It is important that records be sequenced (sorted) in some 


manner for use in IR systems. Sequencing is normally based on 
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some particular attribute of a record (called a sort key) such as the 
mame’ field of an employee record. Selection of the sort key is 
based on many considerations but the objective is to select the same 
Port key as may be used ina retrieval request. Subordinate sort keys 
may also be chosen when more than one record has the same primary 
sort key value (e.g., several employees with the same last name). 
Searching records which are ordered on the primary sort key is then 


called an "ordered search. !! 
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IV. RETRIEVAL 


The retrieval process essentially consists of searching the index 
files and information files for information which satisfies a user's 
query. If the information is found, it is sent to the user, if not, the 
user is so informed. It should be noted that 'searching''and "retrieval" 
are not synonymous. ''Searching'' is a file access operation used to 
locate records for matching against the query, while "retrieval" is 
the actual output of information which satisfies the query. However, 
use of the word ''retrieval'' here will imply the entire operation of 
searching and retrieval. 

As previously discussed in section II, indexing and retrieval are 
counterparts since indexing refers to the structure of information for 
input to the files, while retrieval is the process of locating and dis- 
playing desired information. Therefore, the query language employed 
by the system meee must be compatible with the index language em- 
ployed by the system ‘'esigner. It is important that the query and 
index languages use the same vocabulary in order for the IR system 
to understand the user's requests. The user must also be familiar 
with the system's logic in order to formulate an intelligent query. He 
must know if the system honors the use of Boolean relationships 
(Nand," "or, '' 'not'') and magnitude comparators ("greater than, '' 


"less than, '' etc.) as query terms. 
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Once the query is formulated it is input to the system's index 
file. A matching process takes place at the index file where the terms 
used in the query are matched against the index file records. Index 
records which match the terms of the query are employed as locators 
to direct the retrieval of data from the information file. 

The technique used in searching the index and information files 
is governed by the file organization (structure, sequencing, content, 
and storage medium). In the ensuing discussion of search techniques 
it should be borne in mind that whatever technique is used it is fixed 
within the IR system. Also, the interrelationship between search plan 


and file organization may limit file accessibility and search flexibility. 


A. FULL-FILE SEARCH 

One search plan incorporates a full-file search where every 
mecord Of the file is matched (e.g., the value of the query term is 
matched against the value of the sort key). This plan is used when 
the order of records within a file is unknown (e.g., a file of employee 
records that are not alphabetically sorted). In this case, if we were 
searching for Doe's record and found Smith's it does not follow that 
we have searched too far since the records are not collated. In ad- 
dition, there may not be any assurance that a single match satisfies 
the search (more than one Doe inthe file). Therefore, all records 


within a file must be searched. 
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PD. SHOURNITIAL SEARCH 

A sequential search plan might be used when the records are not 
only sequenced but sequenced on the same term as is used in the query. 
Sequential searches are normally used in conjunction with sequential 
access type storage devices. The records of a file are matched se- 
quentially until a successful match is made or when the value of the 
query term exceeds the value of the sort key. In this case, searching 
for Doe's record and locating Smith's record indicates that the search 
has not only gone too far but no successful retrieval will be made 


since there is no Doe in the file. 


S. BINARY SEARCH 

A binary search plan may also be used with a sequenced file. 
The term "binary'' implies that a two valued decision is made after 
every match attempt. The search begins in the middle of the file. If 
the first match attempt is unsuccessful then the next attempt is made 
one-quarter file length away from the first. The direction of the sub- 
sequent search is dependent upon the result of comparing the value of 
the query term and the sort key (e.g., if the sort key is greater than 
the query term then move one-quarter file aod the beginning of the 
file). Each successive move is then made one-half the length of the 
preceding move. If there are n records in the file then there will be 


approximately log,n moves to exhaust the file. 
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i. Dine! ACCHSSs SEARCHING 

The last file searching technique relies upon a special type of 
index file called an inverted index. This is probably the most common 
type of index file used inIR systems. The inverted file records con- 
sist of the descriptors produced during the indexing process. The 
descriptors are used as sort keys for sequencing the records within 
the index. Appended to each descriptor field are fields which contain 
addresses of the associated records in the information file. Some 
type of search plan is conducted (usually binary) for matching descrip- 
tors (which are sort keys) to the query term. Whena successful 
match is achieved, the addresses of the appropriate information re- 


cords are obtained and the records are directly retrieved. 


E. COMBINED SEARCH PLANS 

The above treatment of search plans demonstrates that the 
techniques are dependent upon file organization but plans may be com- 
bined in one IR system. For example, a binary search may be em- 
ployed in the index file to locate the disk and/or track which contains 
the desired information while a sequential search is made of the track 


for the requested records. 
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Von Rie VAI SYSTEMS 


This section of the paper contains a discussion of the primary 
differences between reference, document, and fact retrieval in order 
to provide a frame of reference for the development of a fact-retrieval 
system. Reference retrieval is treated first since it is the least 
complicated of the three types of information retrieval. 

Queries used in a reference-retrieval system contain only the 
topic for which information is desired (e.g., STEEL). The material 
provided to the requestor is a list of references pertaining to his topic. 

Document retrieval queries are narrower in scope since de- 
scriptive terms are used to modify the topic (e.g., STEEL, chemical 
| properties of). Documents are provided to the requestor which contain 
the desired information. 

Fact-retrieval systems are the most complicated and powerful 
of all since they are capable of providing specific answers to specific 


questions. 


ie REFERENCE RETRIEVAL 

Reference retrieval is the first step taken by one in search of 
specific information. As explained above, a reference-retrieval 
system provides a user with a bibliography pertaining to the topic for 
which specific information is sought. The second step in the search 
for information is totally unrelated to the reference-retrieval system. 


The user must examine the documents listed in the bibligraphy in 
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order to obtain the desired information. It is clear that in the first 
step the user's search for information is narrowed from a search of 


the entire "library" to a '"'shelf'' in the library. 


b. DOCUMENT RETRIEVAL 

The definition of document retrieval is not straight forward. 
One point-of-view holds document retrieval as the second step of 
reference retrieval. In another point-of-view, it is a special case of 
fact retrieval. What this author regards as document retrieval may 
be fact retrieval to another. The definition upheld by this author is 
the retrieval of unprocessed text word-for-word as it is stored in the 
information file. An example would be requesting a specific report 


from a technical library. 


OF FACT RETRIEVAL 

Fact retrieval ranges from the retrieval of processed text 
stored in an information file to the retrieval of specific answers to 
specific questions. The more powerful end of the spectrum is refer- 
red to as ''question answering'"'. Reference 10 contains an excellent 
treatment of the general characterizations, limitations, capabilities, 
and feasibility of the question-answering type of fact-retrieval systems. 
Reference 11 contains a practical example of a question-answering 
program. 

Confusion arises at the low end of the fact-retrieval spectrum 


where it is difficult to distinguish the difference between document 
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and fact retrieval. One point should help clarify the difference. 
Document-retrieval systems possess only rote memory which means 
that their capability is limited to the display of information word-for - 
word as it is stored in the data base. Fact-retrieval systems possess 
the capability of manipulating data stored in the data base into a form 


which best satisfies the user's request. 
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Teen treo UG TURE FOR A FACT-RETRIEVAL SYSTEM 


. This section contains the description of a data structuring tech- 
nique proposed by Kildall [Ref. 3] for use in a general-purpose fact- 
retrieval system. Specific useage of the system depends in part upon 
the type of information stored in its files. However, the nature of the 
system is the processing of data to provide a user with specific answers 
to his queries. Therefore, the system approaches ''question answering. "' 
The data-structuring technique employs the basic concept of hierarch- 
ical classification which divides a topic (also referred to as a universe 
of discourse) into its class structure and correlates the data elements 
of the information file to a tree-type classificatory structure. 

A treatment of the retrieval process is also provided here since 
the query format is directly related to the data-structuring technique. 

This section is expressly devoted to a discussion of the data- 
structuring concept while section VII contains the description of the 
general-purpose fact-retrieval system which employs the proposed 
technique. The system was designed for the primary purpose of in- 
vestigating the potential of the data-structure concept and not for 
production purposes. 

As previously discussed, fact-retrieval systems range from the 
manipulation of processed text to ''question answering.'' The system 
described herein maintains a position in the middle of this continuum. 


The term "general purpose" used here does not necessarily mean that 
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the system may be utilized throughout the full range of fact retrieval. 
Rather, it means that the system will accommodate files which contain 


different types of information, 


A. DATA STRUCTURE 

The structure employed for indexing data incorporates the con- 
cept of hierarchical classification which allows the user to enter the 
data base in a number of ways in order to extract desired information. 
A universe of discourse is structured in terms of "'classes'' anda 
hierarchy of classes is established onto which the associated data 
elements are mapped. For example, assume that a universe of dis- 
course consists of personnel records. The records consist of names, 
addresses, and telephone numbers which are members of the classes 
Pees’, "ADDRESS, '' and "TELEPHONE NUMBER." "NAME" is 
further divided into the subclasses "LAST," 'FIRST," and "MIDDLE"! 
males ADDRESS" contains "STREET, ' "CITY," and "STATE," 
The data structure is then represented by a classificatory tree with 
the data elements related to the classes contained in the tree. The 
data element ''DOE, '' for example, is identified as a member of the 
class "LAST, '' and the class 'LAST'' is a member of "PERSONNEL 
RECORD." All data elements of a structure are identified in this 
fashion. 


1. Class Structure Representation 


Class structures are represented by parenthesized expres- 


sions which are used to define the structure of the classificatory tree. 
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The technique of employing parentheses to define structures is similar 
to that technique employed in LISP S-expressions [Ref. 12]. Punctua- 
tion symbols used in the expressions are the left parenthesis, the 
right parenthesis, and the comma. The parentheses are used to en- 
close those classes which are directly related to a superclass while 
the comma is used to separate the classes within the parenthesized 
unit. Units within an expression are separated by commas and the 
entire expression itself is enclosed by parentheses. As demonstrated 
in the preceding section, "PERSONNEL RECORD" consists of the 
classes: "NAME," "ADDRESS," and 'TELEPHONE NUMBER.' This 
definition is called the format definition and is the foundation for the 
construction of the classificatory tree. Format definitions are 
represented by the parenthesized expression shown below. 

PERSONNEL RECORD (NAME, ADDRESS, TELEPHONE NUMBER) 

"NAME" and "ADDRESS" were further divided into subclasses 
and the expressions below show the parenthesized forms for "class 
definitions, "’ 
Nevo (ees he FIRST, MIDDLE) 
POO (ote, CILY, STATE) 

Subclasses may also be subdivided and this process is replicated 
to fully define the class structure of the universe of discourse. Figure 
4 graphically demonstrates the class structuring process, the fully 


parenthesized expression for the class structure, and the associated 


classificatory tree. Although the above example does not include a 
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Figure 4 


Parenthesized Class Expressions and Associated Free 


Structure for the Hierarchical Classification of Data. 
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subdivision for the class 'STREET'' one is shown in the tree structure 


to demonstrate a third level of class replication. 


2. Data Representation 


Once the class structure is defined, the associated data may 
be mapped directly onto the structure. Data representation is identical 


to the class expression as shown below. 


((DOE, JOHN, JAMES), (203 ELM STREET, MONTEREY, CA. ), 384-9363) 


| 


fect IRS), MIDDLE), (STREET, CITY, STATE), (TELEPHONE NO. ) 
| 


NAME, ADDRESS, TELEPHONE NO. 
EMPLOYEE RECORD 


Representation of repeated data elements within the record are 
easily handled by properly parenthesizing the record. For example, 
two phone numbers for John Doe would be represented by: 


((DOE, JOHN, JAMES), (203 ELM STREET, MONTEREY, CAL. ), 
(384-9363 , 384-6214)) 


The class membership of each data element in the record is 


clearly defined by the parenthesized expression. 


emo ye cern Witt liby, 


The utility of hierarchical classification in association with 


parenthesized expressions is realized by the user in three ways: 
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1. The indexing techniques presented in section II require the 
user to conform to the language devised by the system designer for 
the retrieval of information. The user does not have the option of 
defining the indexing language that best suits his particular needs but 
must be satisfied with the indexing technique employed to best satisfy 
the needs of all users. In contrast, this system allows each user or 
user group to define his own indexing language by defining the class 
structure associated with the data he is most concerned with. In other 
words, the system will accept a mix of data allowing each user or user 
group to have his own retrieval system within a retrieval system. 
Each user or user group must define the class structure of his data. 
For example, a business-oriented system might consist of a data base 
partitioned into employee records, pay records, stock inventory, etc. 
Such a system would simultaneously serve the needs of many users. 


2. The user has the capability of entering the data structure in 
several ways to extract desired information. In the personnel record 
example, the user may retrieve complete records which satisfy cer- 
tain search keys, or retrieve only the names of personnel, or retrieve 
the phone number of a particular person, and so on. 

3. The classification scheme could serve as an intermediate 


language between the query processor and the retrieval system. 


B. RETRIEVAL PROCESS 
1. Query Format 

Queries are presented to the system utilizing the same for- 
mat as class expressions. The fully parenthesized expression contains 
search keys and blank positions which specify the information to be 
supplied to the user. The retrieval processor will fill in the blank 
positions with all of the information contained in the data base which 
satisfies the search keys. The expression must conform identically to 


the fully parenthesized expression used to represent the class structure. 


(( DOE, JOHN, ——), (——,-—— , ——), ) 


42 





In the example above, the system will identify the class member- 
orp of each search key and blank position through the classificatory 
® constructed from the class expression. A search is then instituted 
for all records which contain an occurrence of 'DOE'' as a member of 
the class "LAST" and 'JOHN'"' as a member of the class ''FIRST,"' 
Information is extracted from those appropriate records to fill the 
blank positions of the query. The user may broaden or narrow the 
amount of information retrieved by the number and/or class of search 
keys used inthe query. A query containing only the search key 
"CALIFORNIA" could produce a greater amount of information than a 
query which has only one blank position. 

2. Boolean Expressions 

The ability to use Boolean expressions such as "'and,'' "or, "' 
"not, '' etc., is desirable in any information retrieval package. How- 
ever, the degree to which Boolean expressions may be used is left to 
the perogative of the system designer in satisfying user needs. The 
use of Boolean "'and'' is accepted by the retrieval processor in this 
system and is identified by the amphersand: 
a, ), (, MONTEREY & MARINA, CALIFORNIA), —) 

In this case, the names, street addresses, and phone numbers 
of all personnel who live in Monterey, California and Marina, 
California would be produced. 


The use of Boolean "'or'' is not directly used in this system but 


its effect is similar to the use of alphabetic and numeric range requests. 
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So Alphabetic and Numeric Ranges 


Alphabetic and numeric range requests are identified by the 
colon. Examples of range requests are exhibited below. 
eee), (= MONTEREY, CALIFORNIA), ~~) 
The retrieval processor identifies an alphabetic range request for all 
data elements which are members of the class "LAST"! and which have 
weeemirst letter A, B, C, or D. The records of all personnel who live 
in Monterey, California and whose last names begin with A through D 
inclusive would be produced. 

As shown immediately above, the system does not restrict the 
use of alphabetic or numeric ranges to single letters but any number of 
characters may be used and any number of range requests are possible 
within a single query. 

The above discussion is also true for numeric range requests. 
For example, the user desires complete records for all those personnel 


who have specific telephone exchanges: 


ee ee — , ), 9102394) 
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VMileeoroleh NVM Str RUCTURE 


This section discusses the internal design of the general-purpose 
fact-retrieval system employing the data-structure technique previously 
explained. The system was implemented on the Naval Postgraduate 
School's IBM 360 Model 67 Computer and is an interactive system under 


control of the Cambridge Monitor System (CP/CMS) [Ref. 13]. 


A, DATA FILES 

Data files are stored on punched cards and consist of the following 
three types: 

1. Format definition cards. These cards define the class 
structure for each universe of discourse to be included in the data 
base. An example of a format definition card is: 

biMe bLOYEE RECORD (NAME, ADDRESS, AGE, CHILDREN) 

2. Class definition cards. These cards further define the 
structure of the classes contained in the format definition. Examples 
of class definition cards are: 

NAME (LAST, FIRST) 
PDD oomolnbh Tl CIry, STATE) 

3. Data records. The data records contain the data elements 

associated with the universe of discourse and are fully parenthesized 


expressions. An example of a data record is: 


EMPLOYEE RECORD ((DOE, JOHN), (203 ELM STREET, MONTEREY, CA.), 
(48), (MARY, SALLY)) 
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Format definitions, class definitions, and data records may also 
be acre into the system via on-line terminal. For a large-scale 
a base, the data records could be stored in unstructured form ona 
back-up storage device such as magnetic tape. Structuring of records 


would be accomplished under program control according to pre-stored 


format and class definitions. 


iB. TREE-TYPE DATA STRUCTURES 

A tree-type data structure is employed to represent the hierarch- 
ical classification of a universe of discourse. The tree-structuring 
process described later in this section employs data cells to represent 
nodes within a tree and the "chaining" technique to order the cells into 
tree structure form. 

1. Data Cells 

Data cells available to the tree-structuring processor consist 
of three fields. The description and function of each field is described 
below: 

a. The identifier field, referred to as ''TOP, '' contains the 
storage address (pointer) of the data or class entity which the data cell 
Pepresents. 

b. The right link field, referred toas "RIGHT, '! contains a 
pointer which is used to chain the data cell to another data cell on the 
same level of the tree. 


c. The down link field, referred to as ''DOWN,''' contains 
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a pointer which is used to chain the data cell to another data cell 
located in a lower level of the tree. Figure 5 demonstrates the use 
of data cells. A zero ina link field signifies 'no link" or a null field. 
@eporructuring Process 

Empty data cells are constructed in core storage through 
list structuring techniques and are stored in an area available to the 
tree-structuring routine. The menctice of a format definition card 
iemtiates the structuring process. The format name (e.g., EMPLOYEE 
RECORD) and the class names contained on the card are extracted 
and moved into storage (a discussion of this process is deferred toa 
later section). A number of cells equal to the format name plus the 
number of class names contained on the card are retrieved and tree 
structuring commences. The first cell in the tree structure is called 
a "header" and serves to identify the format name of the tree. Each 
of the classes contained in the format definition is assigned to a data 
cell and the cells are chained together. Figure 6 shows the structure 
representing the format definition: 

PMP wOyYEE RECORD (NAME, ADDRESS, AGE, CHILDREN) 
Before completing the discussion of tree structuring it is import- 

ant to note that class definitions throughout the various universes of 
discourse in the data base must be consistent. That is to say, if the 
class called "NAME" is defined as (LAST, FIRST) then every occur- 


rence of 'NAME'' must consist of the classes "LAST" and 'FIRST." 


If this is not done, confusion arises during the retrieval process when 
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TOP 


RIGHT 


DOWN 


NOTE: The numbers in the TOP fields are sequence 


numbers. 


Figure 5 


Data Cell Composition 
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CHBADER CELL 


EMPLOYEE RECORD (NAME, ADDRESS, AGE, CHILDREN) 


The numbers in the TOP fields correspond to: 


EMPLOYEE RECORD 
NAME 

ADDRESS 

AGE 

CHILDREN 


wm & WwW hw - 


Figure 6 


Tree Structure Composed of Data Cells 
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the processor attempts to identify the class memberships of data 
eeerentel. Therefore, as each format definition is read, a search is 
| 
conducted of all previously constructed trees to determine whether or 
not each of the classes contained in the definition being processed 
have been previously used. If a class has been previously used then 
the tree structure representing the class is appended to the tree being 
built. If a class has not been previously used then a class definition 
card must be submitted to the tree-structuring processor. 

After the format definition card has been processed any class 
definition cards associated with the structure are processed. Figure 7 
contains a completed tree structure for: 

EMPLOYEE RECORD (NAME, ADDRESS, AGE, CHILDREN) 


NAME (LAST, FIRST) 


MOR too ol RENT Cily, STATE) 


OF INDEX FILES 

The system incorporates an index file, called the master index, 
which demonstrates many of the characteristics and advantages of an 
inverted index. The master index contains format names, class names, 
and data elements. Each entry in the index has a pointer associated 
with it which links the entry to a tree structure, data record, or 
further information concerning the entry. The retrieval process is 
always initiated at the master index since it is the agent which directs 


the search for information in response toa user's query. 
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te 
B, 


EMPLOYEE RECORD (NAME, ADDRESS, AGE, CHILDREN) 
NAME (LAST, FIRST) 
ADDRESS (STREET, CITY, STATE) 


The numbers in the TOP fields correspond to: 


1 EMPLOYEE RECORD 6 LAST 

2 NAME 7 FIRST 
3 ADDRESS 8 STREET 
4 AGE Ceca Ly 

5 CHILDREN 10 STATE 


Figure / 


Tree Structure for the Format: ‘EMPLOYEE RECORD" 
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1. Characteristics of the Master Index 
Conceptually, the master index is a large matrix consisting 

of fixed-length records (matrix rows), each containing eight fields 
(matrix columns), as shown in Figure 8. The first four characters of 
format names, class names, and data elements are stored in the first 
four fields of the index. Entries which contain more than four char- 
acters are then stored in a sequential storage area reserved for 
variable-length records. The remaining four fields of each index 
record contain information concerning the type of entry (e.g., format 
name, class name, or data element), the sequential store address of 
the full character representation of the entry, if any, pointers to infor- 
mation-bearing data cells, and other information useful to the retrieval 
processor. 

Peecoustructing the Master Index 

The first record of the master index is reserved as a table 

of all format names contained in the data base. The first record con- 
tains the address of the first data cell (identical to the data cells used 
in tree structuring) in a chain of cells and each cell contains the 
address of a format name located in sequential storage. Through this 
record a user may quickly determine the partitioning of the data base. 
Figure 9 demonstrates the idea. 

Format names are entered in the index and linked to their 
definitions which are located in sequential storage. Each of the clas- 


ses contained in the format definitions are also stored in the index. 
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COLUMN (S) 
1-4 : First 4 characters of the entry 
Dat No. if the entry is a format name 
me" af the entry is a class name 
"UL" dif the entry is the lowest level class ina 
tree structure 
"Dp" aif the entry is a data element 
6 Pointer to the full character representation in 
sequential store 
vee: Pointer to associated chain of data cells if the 
entry is classified "L'', otherwise pointer to 
sequential store 
Cie: Pointer to associated data cell in the tree structure 


if the entry is a class or format. 
Pointer to associated chain of data cells if the entry 
is a data element. 


Figure 8 


Representation of the Master Index 
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STOCK INVENTORY 





PAY RECORD 


SEQUENTIAL STORE 


Reserved Record in the Master Index for Format Names with 
Associated Data Cells and Format Names in Sequential Store 


EMPLOYEE RECORD A 


CLASS 





Figure 9 
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Associated with each class entry in the index is a string of data cells 
which contain two items of information concerning the class: 

a. The first field contains the number of the data record which, 
in turn, contains an occurrence of the class. (This information is 


added when the data records are read and is discussed later. ) 


b. The second field contains a number corresponding to the for- 
mat name which contains this class entry. 


A class may be used in any number of different format definitions 
pitts structure must be consistent in every occurrence. Therefore, 
regardless of the number of format definitions which contain a given 
class, there is only one index record for the class. The data cells 
appended to the class entry provide the retrieval processor with data 
such as the format definitions in which the class appears. Among 
other things, information pertaining to the class entries provides the 
retrieval processor with the capability of quickly abandoning a search 
when a user requests information through a class which is nota 
member of the format being queried. 

Class definitions are processed in a manner very similar to for- 
mat definition processing. The class being defined is entered in the 
index and the definition is stored as read in the sequential store. The 
system returns the sequential store address and enters it in the index 
record. Appropriate data cells are appended to the index and the 
class structure is added to the classificatory tree. When the tree is 
completed, those classes which are end nodes in the classificatory 


tree (e.g., LAST, FIRST, STREET, CITY, STATE, AGE, and 
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CHILDREN in EMPLOYEE RECORD) are identified and their index 

eas are flagged. This is done to ensure that elements in the data 
| 

records are mapped onto the tree structure according to their proper 

class membership. 

As each data record is read into the system it is assigned a 
unique number and placed in the sequential store. Each element within 
the record is examined to determine its class member ship and the 
master index is searched to determine if the element was previously 
entered by another data record. The possibility of a data element 
appearing in more than one record te if the data base contains 
similar formats such as employee records and pay records. In ad- 
dition, a data element may be a member of more than one class such 
as the occurrence of 'JOHN'' as a member of both classes "FIRST"! 
and 'CHILDREN," It is highly desirable that there be only one entry 
in the master index for those elements which occur more than once. 
Unique entries in the index guarantees that when an item is located in 
the index, the search process is complete and successful. Additionally, 
the need for combined search plans is eliminated. Specific record 
and class membership information for each data element entered in 
the index is resolved by appending data cells to the master index entry. 
The data cells contain the record number(s) from which the element 
was extracted and its class membership(s). Assuming that a data 
element occurs several times in the data base, the master index would 


still contain only one record for the element. The record contains all 
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of the information pertinent to the retrieval process. The technique, 
relevant to both class and data entries, results in two important 
Savings: 

I. A significant reduction of storage space is realized (if an 
element occurs several times) since multiple entries in the master 
index require more storage space thana single record and its 
associated data cells. 

2. <A significant reduction in search time is realized since multiple 
entries require the retrieval processor to conduct a full-file search 
each time it enters the master index. 

Spe Wata mecord Table 

Cells appended to each data element stored in the master 

index do not contain the sequential store addresses of the records 
from which the data elements were extracted. This information is 
stored separately ina teble referred to as a data record table. The 
data record table augments the information contained in the master 
index and is composed of fixed-length records as shown in Figure 10. 
Each table record consists of three fields which contain: 

a. The unique data record number. 

b. Format membership of the data record. 

c. Sequential store address of the data record. 

The data record table serves two functions: 

a. The retrieval processor bypasses the master index and directly 
enters the data record table to satisfy requests for all data records 
which are members of a particular universe of discourse. 

b. The table is also utilized for queries other than those which 


request ''all data records. '' The retrieval processor searches the 
master index to determine the data records which satisfy a user's 
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DATA RECORD SEQUENTIAL 
TABLE STORE 
2 


=((DOE, JOHN), (80 WHITNEY,... 


moCoMimaeniicl). (32 CAPITAL... .. 
~ ((041305416), (WRENCH),... 


((DOE, JOHN), (094-63-3152),... 


((EA 3733, CONN), (BUICK,... 





FORMAT NUMBER FORMAT NAME 
1 EMPLOYEE RECORD 
2 CAR REGISTRATION 
S PAY RECORD 
4 STOCK INVENTORY 


COLUMN 
1 : Unique record number 
2 : Format membership of the data record 
3: Pointer to data record in sequential store 


Figure 10 


Representation of the Data Record Table 
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request. Then the processor enters the data record table and extracts 
the sequential store addresses of the records. The sequential store 
addresses are passed to the ‘output'' section of the retrieval processor. 
The information contained in the data record table is tabulated 
separately from the master index to achieve savings in storage space 
and response time. Storage savings are realized since the addresses 
of data records in the sequential store are contained only in the data 
record table and are not replicated in the master index for each class 
and data element. System response time is reduced for queries that 
request all data records of a particular universe of discourse since 
the data record table was designed primarily to expedite this type of 
request. The retrieval processor extracts all of the necessary data 


record addresses in one access of the table. The amount of searching 


watchin the table is minimal, 


IDE INFORMATION FILE 

The ''sequential store'' is the system information file, or data 
base. It contains the data records, format definitions, class definitions, 
and the full character representation of those entries in the master 
index consisting of more than four characters. Figure 11 shows the 
sequential store and its relationship to the master index and the data 
record table. 

The information file is resident in main core storage. The 
variable-length records of this file are sequentially ordered. System 


information files are not normally stored in main core unless they are 
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and 


Relationship between the Master Index, Data Record Table, 


Sequential Store 
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relatively small (which is the case here). However, it is imperative 
that such a file be resident on a direct access storage device in order 


to provide satisfactory system response time. 


De Per Rie vVAtT PROCESSOR 
The retrieval processor is divided into three operations, The 
identification operation determines the type of query posed by the 
user; the search operation determines the data record numbers which 
satisfy the user's request; the output operation retrieves the resultant 
data records from the sequential store and prints them at the terminal. 
Additionally, special messages are output to the user in the form of 
error messages to warn him of invalid queries, and messages which 
motify him of unsatisfied queries. 
1. Query Types 
The IR system designer strives to achieve total utility of the 
system by providing the user with a powerful retrieval language. 
Utility of the data structure used in this system is realized by the 
various types of queries available to the user for extracting informa- 
tion from the data base. There are four major types of queries avail- 
able to the user. 
a. Determining Data Base Partitions. 
As previously discussed, the data base may be partitioned 
to allow a mix of unrelated information by defining the class structure 


of each universe of discourse in the data base. A user who is 
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unfamiliar with the data base partitions (format names) may easily 
mee etre this information by submitting a special type of query. The 
Pormat of the query is simple and consists of the single search key: 
"CLASS." This is translated by the retrieval processor as: "Output 
the names of all formats contained in the data base. '' Search of the 
master index is then centered at the first record of the index and its 
associated chain of data cells which contain the sequential store ad- 
dresses of the format names. All format names contained in the data 
base are output to the user. 
Ore Ye CLASS 
RESPONSE: Dvir tOyEi RECORD 


vee i ORD 


b. Determining Format and Class Definitions. 

In order to extract data from a specific universe of discourse, 
the user must be provided with its class structure. The class structure 
determines the format for data record requests. Queries of format 
and class definitions must contain, as a search key, the format name 
or class name to be defined. The search processor enters the master 
index to locate the format name or class name, extracts the address 
of its definition located in the sequential store, and the definition is 


output directly at the terminal. 
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QUERY: EMPLOYEE RECORD 


RESPONSE: (NAME, ADDRESS, AGE, CHILDREN) 
QUERY: NAME 

RESPONSE: (LAST, FIRST) 
QUERY: AGE 

RESPONSE: NO DESCENDANTS 


c. Data Element Retrieval. 

One asset of the data structure concept is that it allows the 
user to extract single data elements from the data base which are 
members of a particular class and format, or members of a particu- 
lar class irrespective of the format membership. Since data elements 
are mapped onto the end nodes of their respective tree structures, the 
user must use the lowest level classes of the structure as search keys. 
Failure to do so prompts the retrieval processor to output corrective 
information to the user. The hyphens in the queries below indicate to 
the retrieval processor that the expressions Naeeries and not for- 
mat definitions. The processor could identify the expression by 
searching the master index for an occurrence of "EMPLOYEE RECORD." 
A successful search would indicate that a format definition already 


existed in the system. However, use of the hyphen is a simpler and 
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faster method for positively identifying the type of expression submit- 
ted to the system. 
SOI Ky: EMPLOYEE RECORD (NAME, __) 
RESPONSE: ENV ID Ui RY = 
DETERMINE DESCENDANTS OF: NAME 
USE DESCENDANTS AS KEYWORDS 
QUERY: PMe LO RECORD (LAST, ——) 
RESPONSE: BROWN 
SMITH 
THOMPSON 

To answer the above query, a search is conducted in the master 
index for all data elements which are members of the class "LAST" 
Pouogece members of the format VHMPLOYEER RECORD," This infor- 
mation is contained in the data cells appended to each data entry in the 
index. Elements which satisfy the query are taken directly from the 
master index, and output at the terminal. 

In the query below, the hyphen is used to differentiate between a 
query anda class definition statement. All data elements which are 
members of the class "LAST" are output irrespective of format 
membership. The format membership fields of the data cells are 
ignored during the search of the master index. 

QUERY: EAS ea.) 
RESPONSE: BROWN 
CHAMBERS 


COLTEE 


DOE 
SMITH 


THOMPSON 
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qd. Data Record Retrieval. 


| Data record retrieval is the most valuable and would be the 


most frequently used type of request available to the system user. 
Extraction of complete data records which satisfy the search keys 
contained in the query is accomplished. To retrieve data records, the 
queries contain data elements as search keys and may contain Boolean 
N'AND,'" alphabetic and/or numeric ranges, or any combination thereof. 
The query format is a fully parenthesized expression as shown in 
previous sections. Search keys are positioned in the expression with 
respect to class membership and hyphens inserted in those positions 
for which information is requested. Any variation from the properly 
parenthesized expression prompts error mescages from the retrieval 
processor to the user. 

The retrieval process for the query listed below is explained in 
Ete following paragraphs: 

Puimeevnh RECORD ((DOE,——=), Ca. -—, CA. )(), (_)) 

The format name appearing at the beginning of the query expres- 
sion informs the retrieval processor of the universe of discourse in 
which the user is interested. The processor then traverses the tree 
structure for "EMPLOYEE RECORD" to determine the lowest level 
classes in the tree. This information, in conjunction with the proper 
use of parentheses in the query expression, allows the processor to 


identify the class memberships of the search keys contained in the 


query. The user is notified whenever the processor is unable to find a 
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Peehemrcy Ji tie Master index, In this case, the processor attempts 
to recover data which satisfies the remaining search keys. Similar 
action takes place when the processor encounters a search key which 
is not a member of the class specified in the query, or if a search key 
is not a member of the format specified inthe query. Additionally, 
the user is notified whenever the query is improperly formatted. 

Each search key in the query is processed sequentially. The 
retrieval processor searches the master index for an occurrence of 
each key. Record numbers which contain an occurrence of the search 
key are extracted and stored ina list. After all search keys have been 
processed, the retrieval processor ''ANDS" the record numbers in the 
list to determine which records satisfy the query. For example, 
assuming that two key words are used and record numbers 5, 32, and 
67 satisfy the first key word, and record numbers 32 and 67 satisfy 
the second key word, records 32 and 67 are output to the user. Re- 
cord numbers which satisfy the query are passed to the "output" sec- 
tion of the retrieval processor which retrieves the sequential store 
addresses of the records from the data record table and prints the 
records at the terminal. 

A user has the ability to immediately examine the results of his 
query since the system is interactive. The results of one query may 
prompt the user to submit another request, either broadening or 


narrowing the request through judicious use of search keys. In any 


case, the user is guaranteed that if the infermation that he seeks is 
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contained in the data base, he will have quick and easy access to it. 
Appendix A contains a sample run of the fact-retrieval system and 
demonstrates all of the queries available to a user and the system 


responses. 


re, meleRING THE DATA BASE 
1. Changes and Deletions 
Due to the experimental nature of the system, no utility 
routines have been provided for deleting records or making changes 
to existing records. Alterations are accomplished by manually 
changing the card images in the data files. 
2. Additions 
The addition of data records to existing data sets or the sub- 
mission of new universes of discourse are acco~:plished most easily 
without special utility routines. This feature is inherently built into 
the system through the data structuring technique. Addition of a new 
universe of discourse is accomplished by submitting format and class 
definitions, and associated data records either on-line through the 
terminal (automatically) or off-line with card images (manually). New 
data records may also be added to exi. ‘ing data files automatically or 


manually. 


OF, 





VIII. CONCLUSIONS 


| Characteristics of the data~structuring concept as used ina 
general-purpose fact-retrieval system have been discussed throughout 
the preceeding sections. These concepts are summarized here. 

The data structuring technique encompasses the concept of 
hierarchical classification whichis the most widely used method of 
indexing. Hierarchical classification of data is a relatively simple 
technique to use but possesses the power to divide and subdivide a 
universe of discourse into more specific subjects. Additionally, 
hierarchical structures may be created to include a domain of subjects. 
This is advantageous for use in a fact-retrieval system, as previously 
demonstrated, by providing a mix of structures ina single data base. 
Therefore, users with differing interests are provided simultaneous 
access to a single system since each is provided a '"'personal'" retrieval 
system within a larger retrieval system. In addition, the hierarchical 
structure provides a user with multiple avenues of access into his 
information file. 

Parenthesized expressions serve as an intermediate language 
between the query processor and the information retrieval system. 
The query processor is able to determine the class memberships of 
elements within an expression by examination of the parenthesized 
form. It is apparent, however, that the use of parenthesized expres- 


sions is cumbersome and demanding since misplacing parentheses is 
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easy to do and causes loss of meaning of the expression. On the other 
hand, it can be argued that the technique of parenthesizing expressions 


is powerful and an equally powerful substitute is difficult to theorize. 
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